UIMA SDK Overview

IBM’s Unstructured Information Management Architecture (UIMA) is an architecture and software framework for creating, discovering, composing and deploying a broad range of multi-modal analysis capabilities and integrating them with search technologies.

The UIMA framework provides a run-time environment in which developers can plug in and run their UIMA component implementations and with which they can build and deploy UIM applications. The framework is not specific to any IDE or platform.

The UIMA Software Development Kit (SDK) includes an all-Java implementation of the UIMA framework for the development, description, composition and deployment of UIMA components and applications. It also provides the developer with an Eclipse-based (www.eclipse.org) development environment that includes a set of tools and utilities for using UIMA.

This chapter is the intended starting point for readers that are new to the UIMA SDK. It includes this introduction and the following sections:

Section 1.1, UIMA SDK Documentation Overviewprovides a list of the chapters included in the UIMA SDK documentation with a brief summary of each.
Section 1.2, Using the Documentation to get started with the UIMA SDKdescribes a recommended path through the documentation to help get the reader up and running with UIMA,
Section 1.3, What's new in Version 2.0describes the main new capabilities in this version of the UIMA SDK.

Chapter	Description
Overviews
UIMA SDK Overview (This Chapter)	Lists the documents provided in the UIMA SDK documentation set. Provides a recommended path through the documentation for getting started using UIMA. Includes release notes. Provides a brief high-level description of the different software modules included in the UIMA SDK.
UIMA Conceptual Overview	Provides a broad conceptual overview of the UIMA component architecture making contextual references to the other documents in the UIMA SDK documentation set that provide more detail.
Setting up
UIMA Eclipse Tooling Installation and Setup	Provides step-by-step instructions for installing the UIMA SDK in the Eclipse Interactive Development Environment.
Developer's Guides
Annotator and AE Developer's Guide	Tutorial-style guide for building UIMA annotators and analysis engines. This chapter introduces the developer to creating type systems and using UIMA’s common data structure, the CAS or Common Analysis Structure. It demonstrates how to use built in tools to specify and create basic UIMA analysis components.
CPE Developer's Guide	Tutorial-style guide for building UIMA collection processing engines. These manage the analysis of collections of documents from source to sink.
Application Developer's Guide	Tutorial-style guide for using UIMA SDK to create, run and manage UIMA components from your application. Includes integration with semantic search engine and description of a simple GUI provided for submitting and running Semantic Search queries that can exploit UIMA analysis. Also describes APIs for saving and restoring the contents of a CAS using an XML format called XCAS.
Flow Controller Developer's Guide	When multiple components are combined in an Aggregate, each CAS flow among the various components. UIMA provides two built-in flows, and also allows custom flows to be implemented.
Developing Applications using Multiple Subjects of Analysis (Sofas)	A single CAS maybe associated with multiple subjects of analysis (Sofas). These are useful for representing and analyzing different formats or translations of the same document. For multi-modal analysis, Sofas are good for different modal representations of the same stream (e.g., audio and close-captions).This chapter provides the developer details on how to use multiple Sofas in an application.
CAS Multiplier Developer's Guide	A component may add additional CASes into the workflow. This may be useful to break up a large artifact into smaller units, or to create a new CAS that collects information from multiple other CASes.
XMI® and EMF Interoperability	The UIMA Type system and the contents of the CAS itself can be externalized using the XMI standard for XML MetaData. Eclipse Modeling Framework (EMF) tooling can be used to develop applications that use this information.
Tool User Guides
Component Descriptor Editor	Describes the features of the Component Descriptor Editor Tool. This tool provides a GUI for specifying the details of UIMA component descriptors, including those for Analysis Engines (primitive and aggregate), Collection Readers, CAS Consumers and Type Systems.
CPE Configurator	Describes the User Interfaces and features of the CPE Configurator tool. This tool allows the user to select and configure the components of a Collection Processing Engine and then to run the engine.
PEAR Packager	Describes how to use the PEAR Packager utility. This utility enables developers to produce an archive file for an analysis engine that includes all required resources for installing that analysis engine in another UIMA environment.
PEAR Installer	Describes how to use the PEAR Installer utility. This utility installs and verifies an analysis engine from an archive file (PEAR) with all its resources in the right place so it is ready to run.
PEAR Merger User's Guide	Merges multiple PEAR packages into one.
Document Analyzer	Describes the features of a tool for applying a UIMA analysis engine to a set of documents and viewing the results.
CAS Visual Debugger	Describes the features of a tool for viewing the detailed structure and contents of a CAS. Good for debugging.
JCasGen	Describes how to run the JCasGen utility, which automatically builds Java classes that correspond to a particular CAS Type System.
XCAS Viewer	Describes how to run the supplied viewer for XCASes, used in the examples.
References
UIMA FAQs	Frequently Asked Questions about general UIMA concepts. (Not a programming resource.)
Glossary	Main UIMA concepts and their basic definitions.
Component Descriptor Reference	Provides detailed XML format for all the UIMA component descriptors, except the CPE (see next)
CPE Descriptor Reference	Provides detailed XML format for the Collection Processing Engine descriptor.
JavaDocs	JavaDocs detailing the UIMA SDK programming interfaces
CAS Reference	Provides detailed description of the principal CAS interface.
JCas Reference	Provides details on the JCas, a native Java interface to the CAS.
Semantic Search Engine Reference	Describes how to write applications that query a semantic search engine index built using the UIMA SDK.
PEAR Reference	Provides detailed description of the deployable archive format for UIMA components.
XMI CAS Serialization Reference	Provides details about the XMI CAS Serialization

Explore this chapter to get an overview of the different documents that are included with the SDK.
Read Chapter 2, UIMA Conceptual Overview to get a broad view of the basic UIMA concepts and philosophy with reference to the other documents included in the SDK which provide greater detail.
For more general information on the UIMA architecture and how it has been used, refer to the IBM Systems Journal special issue on Unstructured Information Management, on line at http://www.research.ibm.com/journal/sj43-3.html or to the external UIMA website where key publications are listed http://www.research.ibm.com/UIMA/pubs.htm.
Set up the UIMA SDK in your Eclipse environment. To do this, follow the instructions in Chapter 3, UIMA SDK Setup for Eclipse.
Develop sample UIMA annotators, run them and explore the results. Read Chapter 4, Annotator and Analysis Engine Developer’s Guide and follow it like a tutorial to learn how to develop your first UIMA annotator and set up and run your first UIMA analysis engines.
- As part of this you will use a few tools including
  - The UIMA Component Descriptor Editor, described in more detail in Chapter 12, Component Descriptor Editor User’s Guide and
  - The Document Analyzer, described in more detail in Chapter 17, Document Analyzer User's Guide
- While following along in Chapter 4, Annotator and Analysis Engine Developer’s Guide reference documents that may help are:
  - Chapter 23 for understanding the analysis engine descriptors
  - Chapter 27, JCas Reference for understanding the JCas
Learn how to create, run and manage a UIMA analysis engine as part of an application. Connect your analysis engine to the provided semantic search engine to learn how a complete analysis and search application may be built with the UIMA SDK. Chapter 6, Application Developer’s Guide will guide you through this process.
- As part of this you will use the document analyzer (described in more detail in Chapter 17, Document Analyzer User's Guide) and semantic search GUI tools (described in section 6.5.2, Semantic Search Query Tool.
Pat yourself on the back. Congratulations! If you reached this step successfully, then you have an appreciation for the UIMA analysis engine architecture. You would have built a few sample annotators, deployed UIMA analysis engines to analyze a few documents, searched over the results using the built-in semantic search engine and viewed the results through a built-in viewer – all as part of a simple but complete application.
Develop and run a Collection Processing Engine (CPE) to analyze and gather the results of an entire collection of documents. Chapter 5, Collection Processing Engine Developer's Guide will guide you through this process.
- As part of this you will use the CPE Configurator tool. For details see Chapter 13, Collection Processing Engine Configurator User's Guide.
- You will also learn about CPE Descriptors. The detailed format for these may be found in Chapter 24, Collection Processing Engine Descriptor Reference.

Learn how to package up an analysis engine for easy installation into another UIMA environment. Chapter 14, PEAR Packager and Chapter 15, PEAR Installer User's Guide will teach you how to create UIMA analysis engine archives so that you can easily share your components with a broader community.

Version 2.0 provide new capabilities and refines several areas of the UIMA architecture.

New Capabilities

New Primitive data types

UIMA now supports Boolean (bit), Byte, Short (16 bit integers), Long (64 bit integers), and Double (64 bit floating point) primitive types, and arrays of these. These types can be used like all the other primitive types.

Simpler Analysis Engines and CASes

Version 1.x made a distinction between Analysis Engines and Text Analysis Engines. This distinction has been eliminated in Version 2 - new code should just refer to Analysis Engines. Analysis Engines can operate on multiple kinds of artifacts, including text.

Version 1.x made a distinction between CASes and TCASes. TCAS are now deprecated; new code should just refer to CASes. The JCas capability to have a Java-friendly way to work with CAS types remains; we clarify that the JCas is just (one of potentially several) interfaces to the CAS.

Sofas and CAS Views simplified

The APIs for manipulating multiple subjects of analysis (Sofas) and their corresponding CAS Views have been simplified.

Analysis Component generalized to support multiple new CAS outputs

Analysis Components, in general, can make use of new capabilities to return multiple new CASes, in addition to returning the original CAS that is passed in. This allows components to have Collection Reader-like capabilities, but be placed anywhere in the flow. See CAS Multiplier Developer's Guide .

User-customized flow controllers

A new component, the Flow Controller, can be supplied by the user to implement arbitrary flow control for CASes within an Aggregate. This is in addition to the two built-in flow control choices of linear and language-capability flow. See Flow Controller Developer's Guide .

Search Engine updated with new capability to index Annotation feature values

The search engine that is provided with the UIMA SDK has been upgraded to a later release; it is more scalable and now has the ability to index additional information from Annotations. The SIAPI.pdf reference documentation for this has been updated. The SemanticSearchCasIndexer now supports indexing individual features of annotations in addition to their types.

Backwards Compatibility

For the most part, applications and components should work unchanged under version 2.0 However, please note the following non-compatible changes:

The format for indexes produced by the SemanticSearchCasIndexer has changed. Indexes that were generated using the v1.x SDK cannot be read with v2.0. You must reindex your content in v2.0.
There have been some changes to ResultSpecifications. We do not guarantee 100% backwards compatibility for applications that made use of them, although most cases should work.
For applications that deal with multiple subjects of analysis (Sofas), the rules that determine whether a component is Multi-View or Single-View have been made more consistent. A component is considered Multi-View if and only if it declares at least one inputSofa or outputSofa in its descriptor. This leads to the following incompatibilities in unusual cases:
- It is an error if an annotator that implements the TextAnnotator or JTextAnnotator interface also declares inputSofas or outputSofas in its descriptor. Such annotators must be Single-View.
- Annotators that implement GenericAnnotator but do not declare any inputSofas or outputSofas will now be passed the view of default Sofa instead of the Base CAS.

Other changes

TextAnalysisEngine has been deprecated - it is now no different than AnalysisEngine. Previous code that uses this should still continue to work, however.

Methods that were defined on the TCAS interface have been moved to the base CAS interface; the TCAS interface is no longer needed.

The DocumentAnalyzer tool saves outputs in the new XMI serialization format. The XCasAnnotationViewer and SemanticSearchGUI tools can read both the new XMI format and the previous XCAS format.

General

The UIMA SDK supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies.

It includes APIs and tools for creating analysis components. Examples of analysis components include tokenizers, summarizers, categorizers, parsers, named-entity detectors etc. Tutorial examples are provided with the SDK; additional components are available from the community.

The UIMA SDK also includes a semantic search engine for indexing the results of analysis and for using this semantic index to perform more advanced search.

Programming Language Support

UIMA supports the development and integration of analysis algorithms developed in different programming languages.

The SDK is principally focussed on Java development. It also includes facilities for C++ Enablement for UIMA Components which allow UIMA components to be written in C++ and have access to a C++ version of the CAS. When used in this manner, the Java UIMA framework can incorporate analytic functions written in C++. Optional files included with the UIMA SDK describe this functionality and provide example code. See the Quick Start manual for more information on this.

Other languages, including Python, Perl, and TCL, are being added to the list.

Multi-Modal Support

The UIMA architecture supports the development, discovery, composition and deployment of multi-modal analytics, including text, audio and video. Annotations, Artifacts, and S discuss this is more detail.

Availability and Open Source

The SDK is available from IBM's alphaWorks (http://www.alphaworks.ibm.com/tech/uima). The source code for the main UIMA framework is available on SourceForge (http://uima-framework.sourceforge.net ).

Module	Description
UIMA Framework Core	A framework integrating core functions for creating, deploying, running and managing UIMA components, including analysis engines and Collection Processing Engines in collocated and/or distributed configurations. The framework includes an implementation of core components for transport layer adaptation, CAS management, workflow management based on declarative specifications, resource management, configuration management, logging, and other functions.
C++ and other programming language Interoperability	Includes C++ CAS and supports the creation of UIMA compliant C++ components that can be deployed in the UIMA run-time through a built-in JNI adapter. This includes high-speed binary serialization. Includes support for creating service-based UIMA engines outside of SDK. This is ideal for wrapping existing code written in different languages.
Externalized Framework Plug-ins	Note that interfaces of these components are available to the developer but different implementations are possible in different implementations of the UIMA framework.
CAS	These classes provide the developer with typed access to the Common Analysis Structure (CAS), including type system schema, elements, subjects of analysis and indices. Multiple subjects of analysis (Sofas) mechanism supports the independent or simultaneous analysis of multiple views of the same artifacts (e.g. documents), supporting multi-lingual and multi-modal analysis.
JCas	An alternative interface to the CAS, providing Java-based UIMA Analysis components with native Java object access to CAS types and their attributes or features, using the JavaBeans conventions of getters and setters.
Collection Processing Management (CPM)	Core functions for running UIMA collection processing engines in collocated and/or distributed configurations. The CPM provides scalability across parallel processing pipelines, check-pointing, performance monitoring and recoverability.
Resource Manager	Provides UIMA components with run-time access to external resources handling capabilities such as resource naming, sharing, and caching.
Configuration Manager	Provides UIMA components with run-time access to their configuration parameter settings.
Logger	Provides access to a common logging facility.
Tools and Utilities
JCasGen	Utility for generating a Java object model for CAS types from a UIMA XML type system definition.
Saving and Restoring CAS contents	APIs in the core framework support saving and restoring the contents of a CAS to streams using an XMI format.
PEAR packager for Eclipse	Tool for building a UIMA component archive to facilitate porting, registering, installing and testing components.
PEAR Installer	Tool for installing and verifying a UIMA component archive in a UIMA installation.
PEAR Merger	Utility that combines multiple PEARs into one.
Component Descriptor Editor	Eclipse Plug-in for specifying and configuring component descriptors for UIMA analysis engines as well as other UIMA component types including Collection Readers and CAS Consumers.
CPE Configurator	Graphical tool for configuring Collection Processing Engines and applying them to collections of documents.
Java Annotation viewer	Viewer for exploring annotations and related CAS data.
CAS Visual Debugger	Provides developer with detailed visual view of the contents of a CAS.
Document Analyzer	Graphical tool for applying analysis engines to sets of documents and viewing results.
Example Analysis Components
Semantic Search CAS Indexer	CAS Consumer that uses the semantic search engine indexer to build an index from a stream of CASes. Requires the semantic search engine (included).
Database Writer	CAS Consumer that writes the content of selected CAS types into a relational database, using JDBC. This code is in the doc/examples/src/com/ibm/uima/examples/ cpe/PersonTitleDBWriterCasConsumer
Annotators	Set of simple annotators meant for pedagogical purposes. Includes: Date/time, Room-number, Regular expression, Tokenizer, and Meeting-finder annotator. There are also sample Annotators in C++ and Python. There are sample CAS Multipliers as well.
Flow Controllers	There is a sample flow-controller based on the whiteboard concept of sending the CAS to whatever annotator hasn't yet processed it, when that annotator's inputs are available in the CAS.
File System Collection Reader	Simple Collection Reader for pulling documents from the file system and initializing CASes.
XMI Collection Reader, Cas Consumer	Reads and writes the CAS in XMI format
Search Components
Semantic Search Engine	Search Engine that supports searching over results of analysis including annotations and nested annotations using the "XML Fragment" query language.
Components not currently available in this release of the UIMA SDK.	If interested in these extensions please contact the UIMA team at IBM. T.J. Watson Research Center via www.ibm.com/research/uima
Semantic search and Analysis Workbench (SAW)	Graphical User Interface for applying analysis to build search indices and DBs and query interfaces for searching/exploring analysis results. Uses the semantic search engine and the EKDB (see below).
Extracted Knowledge Database (EKDB)	Database schema and APIs for creating and populating a relational database with analysis results including entity and relation annotations. Includes a CAS Consumer that populates the database. Semantic Analysis Workbench provides a front-end to this database and to the Semantic Search Engine’s query processor.

UIMA SDK Capabilities